{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cart pole balancing with policy gradient\n",
    "\n",
    "Now, let's learn how to implement the policy gradient algorithm with reward-to-go for the\n",
    "cart pole balancing task.\n",
    "\n",
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.0.0\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "import numpy as np\n",
    "import gym"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a clear understanding of how the policy gradient method works, we use\n",
    "TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow.compat.v1 as tf\n",
    "tf.disable_v2_behavior()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the cart pole environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('CartPole-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the state shape:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "state_shape = env.observation_space.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the number of actions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_actions = env.action_space.n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing discounted and normalized reward\n",
    "\n",
    "Instead of using the rewards directly, we can use the discounted and normalized rewards. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the discount factor, $\\gamma$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "gamma = 0.95"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's define a function called `discount_and_normalize_rewards` for computing the discounted and normalized rewards:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def discount_and_normalize_rewards(episode_rewards):\n",
    "    \n",
    "    #initialize an array for storing the discounted reward\n",
    "    discounted_rewards = np.zeros_like(episode_rewards)\n",
    "    \n",
    "    #compute the discounted reward\n",
    "    reward_to_go = 0.0\n",
    "    for i in reversed(range(len(episode_rewards))):\n",
    "        reward_to_go = reward_to_go * gamma + episode_rewards[i]\n",
    "        discounted_rewards[i] = reward_to_go\n",
    "        \n",
    "    #normalize and return the reward\n",
    "    discounted_rewards -= np.mean(discounted_rewards)\n",
    "    discounted_rewards /= np.std(discounted_rewards)\n",
    "    \n",
    "    return discounted_rewards"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building the policy network\n",
    "\n",
    "First, let's define the placeholder for the state:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "state_ph = tf.placeholder(tf.float32, [None, state_shape], name=\"state_ph\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the placeholder for the action:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "action_ph = tf.placeholder(tf.int32, [None, num_actions], name=\"action_ph\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the placeholder for the discounted reward:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "discounted_rewards_ph = tf.placeholder(tf.float32, [None,], name=\"discounted_rewards\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the layer 1:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "layer1 = tf.layers.dense(state_ph, units=32, activation=tf.nn.relu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the layer 2, note that the number of units in the layer 2 is set to the number of\n",
    "actions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "layer2 = tf.layers.dense(layer1, units=num_actions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obtain the probability distribution over the action space as an output of the network by\n",
    "applying the softmax function to the result of layer 2:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "prob_dist = tf.nn.softmax(layer2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We learned that we compute gradient as:\n",
    "\n",
    "$$\\nabla_{\\theta} J(\\theta) = \\frac{1}{N} \\sum_{i=1}^{N}\\left[\\sum_{t=0}^{T-1}  \\nabla_{\\theta} \\log \\pi_{\\theta}\\left(a_{t} | s_{t}\\right)R_t\\right] $$\n",
    "    \n",
    "After computing the gradient we update the parameter of the network using the gradient\n",
    "ascent as:    \n",
    "\n",
    "$$\\theta = \\theta + \\alpha \\nabla_{\\theta} J(\\theta) $$\n",
    "\n",
    "However, it is a standard convention to perform minimization rather than maximization.\n",
    "So, we can convert the above maximization objective into the minimization objective by just\n",
    "adding a negative sign.\n",
    "\n",
    "Thus, we can define negative log policy as:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "neg_log_policy = tf.nn.softmax_cross_entropy_with_logits_v2(logits = layer2, labels = action_ph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's define the loss:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "loss = tf.reduce_mean(neg_log_policy * discounted_rewards_ph) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the train operation for minimizing the loss using Adam optimizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = tf.train.AdamOptimizer(0.01).minimize(loss)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the network\n",
    "\n",
    "Now, let's train the network for several iterations. For simplicity, let's just generate one\n",
    "episode on every iteration.\n",
    "\n",
    "Set the number of iterations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_iterations = 1000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration:0, Return: 71.0\n",
      "Iteration:10, Return: 12.0\n"
     ]
    }
   ],
   "source": [
    "#start the TensorFlow session\n",
    "with tf.Session() as sess:\n",
    "    \n",
    "    #initialize all the TensorFlow variables\n",
    "    sess.run(tf.global_variables_initializer())\n",
    "    \n",
    "    #for every iteration\n",
    "    for i in range(num_iterations):\n",
    "        \n",
    "        #initialize an empty list for storing the states, actions, and rewards obtained in the episode\n",
    "        episode_states, episode_actions, episode_rewards = [],[],[]\n",
    "    \n",
    "        #set the done to False\n",
    "        done = False\n",
    "        \n",
    "        #initialize the state by resetting the environment\n",
    "        state = env.reset()\n",
    "\n",
    "        #initialize the return\n",
    "        Return = 0\n",
    "\n",
    "        #while the episode is not over\n",
    "        while not done:\n",
    "            \n",
    "            #reshape the state\n",
    "            state = state.reshape([1,4])\n",
    "            \n",
    "            #feed the state to the policy network and the network returns the probability distribution\n",
    "            #over the action space as output which becomes our stochastic policy \n",
    "            pi = sess.run(prob_dist, feed_dict={state_ph: state})\n",
    "            \n",
    "            #now, we select an action using this stochastic policy\n",
    "            a = np.random.choice(range(pi.shape[1]), p=pi.ravel()) \n",
    "            \n",
    "            #perform the selected action\n",
    "            next_state, reward, done, info = env.step(a)\n",
    "            \n",
    "            #render the environment\n",
    "            env.render()\n",
    "            \n",
    "            #update the return\n",
    "            Return += reward\n",
    "            \n",
    "            #one-hot encode the action\n",
    "            action = np.zeros(num_actions)\n",
    "            action[a] = 1\n",
    "            \n",
    "            #store the state, action, and reward into their respective list\n",
    "            episode_states.append(state)\n",
    "            episode_actions.append(action)\n",
    "            episode_rewards.append(reward)\n",
    "\n",
    "            #update the state to the next state\n",
    "            state=next_state                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \n",
    "\n",
    "\n",
    "        #compute the discounted and normalized reward\n",
    "        discounted_rewards= discount_and_normalize_rewards(episode_rewards)\n",
    "        \n",
    "        #define the feed dictionary\n",
    "        feed_dict = {state_ph: np.vstack(np.array(episode_states)),\n",
    "                     action_ph: np.vstack(np.array(episode_actions)), \n",
    "                     discounted_rewards_ph: discounted_rewards \n",
    "                    }\n",
    "                    \n",
    "        #train the network\n",
    "        loss_, _ = sess.run([loss, train], feed_dict=feed_dict)\n",
    "\n",
    "        #print the return for every 10 iteration\n",
    "        if i%10==0:\n",
    "            print(\"Iteration:{}, Return: {}\".format(i,Return))  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have learned how to implement the policy gradient algorithm with rewardto-go, in the next section, we will learn another interesting variance reduction technique\n",
    "called policy gradient with baseline. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
